-
Notifications
You must be signed in to change notification settings - Fork 1
[feat][improvement] new arch/implementation for scaling existing nodepools #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #63 +/- ##
==========================================
- Coverage 74.19% 67.04% -7.16%
==========================================
Files 37 38 +1
Lines 2279 2640 +361
==========================================
+ Hits 1691 1770 +79
- Misses 452 552 +100
- Partials 136 318 +182 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the LKE provider to support concurrency-safe multi-node pool scaling with strict 1:1 NodeClaim→Linode mapping. Key changes include replacing pool-level tags with instance-level tags for claim ownership, implementing pool-scoped locking, adding retryable error types, and expanding test coverage for both Standard and Enterprise LKE tiers.
Changes:
- Replaces pool-level
NodeClaimtags with instance-level claim tags usingUpdateInstanceAPI - Introduces keyed mutex for pool-scoped concurrency control and claim/scale operations
- Adds explicit retryable error sentinels (
ErrNoClaimableInstance,ErrClaimFailed, etc.) and bounded retry logic - Expands fake Linode API to support tag filtering,
UpdateInstance, andDeleteLKENodePoolNode
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/providers/lke/lke.go | Core LKE provider refactor: claim-or-scale flow, instance-level tagging, pool locking, tier-specific logic |
| pkg/utils/keyedmutex.go | New keyed mutex for pool-scoped synchronization |
| pkg/utils/utils.go | Removed custom Filter type, added GetInstanceTagsForLKE, GetTagValue, switched to linodego.Filter |
| pkg/fake/linodeapi.go | Enhanced fake API: tag filtering, UpdateInstance, DeleteLKENodePoolNode support |
| pkg/providers/lke/suite_test.go | Expanded LKE tests for Standard/Enterprise tiers: idempotency, claiming, scaling, error paths |
| pkg/controllers/nodeclaim/garbagecollection/suite_test.go | Added LKE-specific GC tests for both tiers |
| pkg/providers/instance/types.go | Removed pool-specific fields from Instance type |
| pkg/linode/sdk.go | Extended LinodeAPI interface with UpdateInstance, DeleteLKENodePoolNode |
| pkg/operator/operator.go | Updated to pass cluster name to LKE provider and use linodego.Filter |
| docs/lke-nodepool-scaleup.md | New design doc detailing architecture, invariants, flow, and API call analysis |
Comments suppressed due to low confidence (1)
pkg/providers/lke/suite_test.go:1
- Corrected spelling of 'malformatedtag' to 'malformedtag'.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 27 out of 27 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
pkg/providers/lke/suite_test.go:1
- Corrected field name from 'ClusterName' to 'ClusterID' to match the actual parameter type.
/*
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add design document describing the core model, invariants, and flow for scaling existing LKE node pools in LKE mode. Covers 1:1 NodeClaim-to-VM mapping, tier-specific behavior (enterprise vs standard), concurrency model with keyed mutex, claim-or-scale loop with 15s timeout, and error handling policy.
Add golangci-lint as a managed tool in Makefile with version v2.8.0. Include it in verify target and tools phony target. Add renovate configuration for automated version updates.
This commit introduces comprehensive LKE NodePool scale-up functionality with proper concurrency control and extensive test coverage. Major Changes: - Add LKE NodePool scale-up capability to support existing NodePool expansion - Implement KeyedMutex utility for per-key synchronization to prevent race conditions - Enhance LKE provider with cluster name tracking and operation timeouts - Add comprehensive test suites for garbage collection and LKE operations
Remove the `v1` import alias for the `v1alpha1` package throughout the codebase. Use the full `v1alpha1` package name directly to avoid confusion with the Karpenter v1 API. Update all references in cloudprovider, controllers, drift detection, and utility packages. Also fix incorrect finalizer constant reference from `v1.TerminationFinalizer` to `karpv1.TerminationFinalizer` in nodeclass controller.
Add explicit `ErrNodesProvisioning` error to handle nodes with zero InstanceID in standard tier. Check all nodes before returning error to prioritize finding claimable instances. Add 500ms sleep and retry when nodes are still provisioning. Remove redundant type declaration in fake client.
Extract LKE pool lookup into `findLKENodePoolFromLinodeInstanceID` helper to reduce duplication in Delete method. Remove redundant instance fetching, tag parsing, and pool mutex locking from Delete - all now handled by the helper. Use `findNodeInPool` helper for node lookup instead of inline loop. Update simple example to add 30s consolidation period. Reduce default Karpenter replicas from 2 to 1 in chart values.
…tag in LKE mode Replace custom `lke-pool-id` tag with Linode's automatic `nodepool=<id>` tag for pool identification in enterprise tier. Use `LKEClusterID` field on instances for cluster association instead of tags. Update instance listing to filter by cluster ID field. Simplify pool resolution logic by using native Linode tags. Update fake client to set `LKEClusterID` on created instances. Remove pool ID tag from instance claim flow. Fixed major issues with LKE enterprise creation flow (no more duplicate nodes being created due to API timing). Also fixed List() behaviour that was throwing a lot of errors.
- Instance model cleanup: Remove NodeID, Labels, Taints, PoolID from Instance struct; simplify NewLKEInstance constructor. Reduces per-instance memory and eliminates unused pool-derived fields. These field were not being used at all by the instanceToNodeClaim() consumer. - API call reduction: Eliminate pool lookups in Get/List/hydrateInstanceFromLinode; use direct instance data. Cuts ListInstances + GetLKENodePool calls per operation. Now we just do 1 API call. - Enterprise test parity: Add comprehensive tests for enterprise-tier garbage collection with proper tag-based discovery and provider setup. Did the same comprehensive parity test case additions for lke provider tests. - Test suite restructuring: Reorganize LKE suite tests under explicit Standard/Enterprise tier contexts with consistent subcontexts (Create, Idempotency, Multi-node scaling, Tag verification, Error paths, Create options, Get, List, Delete, CreateTags). Remove duplicate top-level Create options and pool filtering tests. - Documentation: Update LKE exisiting nodepool scaleup docs with clarified tier behaviours, API call budgets, and scalability notes (more info on potential rate limit we can hit).
…helper extraction. Reduced the linter complexity. Extract instance type resolution, existing instance lookup, and pool locking into separate helper methods. Introduce explicit error types (ErrNoClaimableInstance, ErrClaimFailed, ErrPoolScaleFailed) to distinguish retry-able conditions from fatal errors. Replace inline pool mutex lock/unlock with withPoolLock helper for cleaner resource management. Simplify attemptCreate by removing nodeID return value and using instance
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Add ProviderConfig struct to LKE provider with CreateDeadline, TagVerificationTimeout, and RetryDelay fields. Add corresponding CLI flags and environment variables (LKE_CREATE_DEADLINE, LKE_TAG_VERIFICATION_TIMEOUT, LKE_RETRY_DELAY) with defaults matching previous hardcoded values (10s, 4s, 2s). Thread config through operator initialization and test setup. Extract retry error check into isRetryableCreateError helper. Replace hardcoded sleep
Add lkeCreateDeadline, lkeTagVerificationTimeout, and lkeRetryDelay settings to chart values with defaults (10s, 4s, 2s). Wire settings through deployment template as LKE_CREATE_DEADLINE, LKE_TAG_VERIFICATION_TIMEOUT, and LKE_RETRY_DELAY environment variables. Add documentation comments for each timeout parameter.
9985030 to
6534d08
Compare
Add retryable error checks to Create and verifyTagsApplied loops using utils.IsRetryableError. Distinguish between retryable and fatal errors in verifyTagsApplied by returning wrapped error for non-retryable cases. Update Get method to use context parameter instead of Background() and distinguish NotFound errors from other API failures. Expand IsRetryableError to include timeout and EOF errors (copied from CAPL)
Add design document describing the core model, invariants, and flow for scaling existing LKE node pools in LKE mode. Covers 1:1 NodeClaim-to-VM mapping, tier-specific behavior (enterprise vs standard), concurrency model with keyed mutex, claim-or-scale loop with 15s timeout, and error handling policy.
Add golangci-lint as a managed tool in Makefile with version v2.8.0. Include it in verify target and tools phony target. Add renovate configuration for automated version updates.
Replace colon (':') with equals sign ('=') as the delimiter in tag formatting throughout LKE provider implementation and tests. Update tag creation in instance provider, tag verification logic, pool tag generation, and all test assertions to use 'key=value' format instead of 'key:value' format.
Remove custom KeyedMutex implementation from pkg/utils and replace with keymutex.KeyMutex from k8s.io/utils/keymutex. Update LKE provider to use keymutex.NewHashed(0) and LockKey/UnlockKey methods instead of custom Lock/Unlock methods.
AshleyDumaine
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, tested locally since Github is partially on fire.
Summary
This PR implements a complete, concurrency‑safe LKE node pool scale‑up flow for Karpenter. It enforces a strict 1:1 NodeClaim→Linode mapping, adds retryable error sentinels to distinguish transient failures, and brings Standard and Enterprise LKE tiers to parity. It also expands the fake Linode API to support more realistic tag filtering and node pool operations, adds LKE‑specific GC tests, and documents the full design.
What changed (by area)
LKE Provider
ErrNoClaimableInstance,ErrClaimFailed,ErrPoolScaleFailed,ErrNodesProvisioning) and a bounded retry loop.(nodepool, instanceType).Concurrency / Architecture
pkg/utils/keyedmutex.goFake Linode API
pkg/fake/linodeapi.goLinode SDK Interface
pkg/linode/sdk.goGarbage Collection
pkg/controllers/nodeclaim/garbagecollection/suite_test.goDocs + Examples
docs/lke-nodepool-scaleup.mdexamples/v1/simple.yamlconsolidateAfter: 30sin default NodePool example.verify.Architectural Notes
Tests
make testmake verifyWhy this matters
This makes LKE mode robust for existing pool scale‑up, reduces race conditions, and improves observability and test coverage across both LKE tiers.